kOps K8s Control Plane Monitoring with Datadog

Xing Du
5 min readSep 15, 2022

Context

Due to a lack of control plane observability, I recently re-integrated Datadog helm chart on our kops-provisioned k8s cluster. kops is definitely not the most popular k8s solution and the official control plane monitoring guide doesn't cover detailed steps.

Throughout the process, I ran into one major issue (details later), potentially caused by compatibility between kops and datadog-agent. Investigation kept me busy, and I still don't have a definitely answer to "fix" it. However, I came up with a solution to bypass this issue to ensure full visibility coverage for the control plane.

Overview

A k8s control plane has 4 major components:

  • kube-api-server
  • etcd
  • kube-scheduler
  • kube-controller-manager

all of which are supported by native Datadog integrations (comes with datadog-agent). The recommended integration guide depends on kubernetes integration auto-discovery, but it does not work on kops-provisioned control planes

I’ll walk through the issue and findings, and follow up with a step-by-step guide on how to bypass it.

The details covered here are based on the following setup:

For brevity, I’ll refer to “kops-provisioned control plane node(s)" as "control node(s)" unless explicitly specified.

Problem

Using kube-scheduler to illustrate the problem (same problem for all 4 components).

Example integration (values.yaml for datadog helm chart):

datadog:
apiKey: <DATADOG_API_KEY>
...
ignoreAutoConfig:
- kube_scheduler
...
confd:
kube_scheduler.yaml: |-
ad_identifiers:
- kube-scheduler
instances:
- prometheus_url: https://%%host%%:10259/metrics
ssl_verify: false
bearer_token_auth: true

This is the recommended approach from the official control plane monitoring guide, and the approach here is based on kubernetes integration auto-discovery

On a control node, the configuration above does NOT turn on the integration(s):

  • valid configuration file for the integration exists under /etc/datadog-agent/conf.d/
  • integration is not detected as on via agent status output

Investigation and Findings

I thoroughly checked the configuration content and walked through the helm chart values reference as the first thing to check, and the I did not see anything wrong.

I’ve had experience setting up Datadog helm chart for control plane monitoring on EKS clusters and docker-desktop / minikube using helm, the identical configuration doesn't work 100% but at least the integrations are detected correctly with auto-discovery. The container names I saw via running docker ps on control nodes have the right shortname & image name (which are used to derive ad_identifier by datadog-agent). So I'm sure the configuration (especially the ad_identifiers section) is not the problem.

The next thing I did was turn on debug log(datadog.logLevel: debug, logs available at /var/log/datadog/agent.log) for datadog helm chart on both my kops cluster and a docker-desktop / minikube. From the debug log I figured roughly how auto-discovery datadog-agent works:

  • filed-based configurations (/etc/datadog-agent/conf.d/) are loaded into memory and running containers & processes are detected
  • each detected container/process will have an identifier, which is compared against configurations of integrations with auto-discovery turned on (via ad_identifiers)
  • once an ad_identifier is matched, the rest of the yaml configuration will be used for the integration.

The process above can be verified via debug log.

On a control node, the desired container (kube-scheduler, same for the other 3 components) is NOT identified as kube-scheduler. I noticed many containers were identified as container ids (in the format of "docker://<container_id>") but none of the container ids match the actual container id for kube-scheduler (you can identify container id by kubectl describe pod/<kube-scheduler-pod-name> or ssh to control node and run docker ps).

Either the kube-scheduler(same for the other 3 components) is not detected, or it is detected as a container id that doesn't match its own.

This is where I realized that there aren’t further actionable things I can do with this approach. Fortunately, my goal is to get integrations working for control nodes, one way or another, and I was able to come up with an alternative solution.

Solution

The TL;DR; version of the solution is: use file-based configuration without auto-discovery.

Integrations are driven by configuration files (located under /etc/datadog-agent/conf.d/). The helm-native approach mentioned above works by converting the datadog.confd key-value pairs to one auto_conf.yaml per integration. The non-helm solution for configuration is to provision your own conf.yamls for your integrations.

To bypass the auto-discovery issue on kops-provisioned cluster, we can:

  • provision a ConfigMap with desired configurations
  • mount the ConfigMap as volume(s) to datadog-agent: agents.volumes + agents.volumeMounts
  • replace template variables with resolvable ones
  • disable auto-config (synonym for “autodiscovery”) for integration: datadog.ignoreAutoConfig

Datadog configuration k8s ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
name: my-datadog-configmap
data:
kube_apiserver_metrics.yaml: |+
init_config:
instances:
- prometheus_url: https://%%env_DD_KUBERNETES_KUBELET_HOST%%:443/metrics
tls_verify: false
bearer_token_auth: true
bearer_token_path: /var/run/secrets/kubernetes.io/serviceaccount/token
etcd.yaml: |+
init_config:
instances:
# etcd-manager-main
- prometheus_url: "https://%%env_DD_KUBERNETES_KUBELET_HOST%%:4001/metrics"
tls_verify: false
tls_cert: /host/etc/kubernetes/pki/etcd-manager-main/etcd-clients-ca.crt
tls_private_key: /host/etc/kubernetes/pki/etcd-manager-main/etcd-clients-ca.key
# etcd-manager-events
- prometheus_url: "https://%%env_DD_KUBERNETES_KUBELET_HOST%%:4002/metrics"
tls_verify: false
tls_cert: /host/etc/kubernetes/pki/etcd-manager-events/etcd-clients-ca.crt
tls_private_key: /host/etc/kubernetes/pki/etcd-manager-events/etcd-clients-ca.key
kube_scheduler.yaml: |+
init_config:
instances:
- prometheus_url: "http://%%env_DD_KUBERNETES_KUBELET_HOST%%:10251/metrics"
ssl_verify: false
kube_controller_manager.yaml: |+
init_config:
instances:
- prometheus_url: "http://%%env_DD_KUBERNETES_KUBELET_HOST%%:10252/metrics"
ssl_verify: false

Explanation

  • template variables are specific to autodiscovery feature. In the context of non-autodiscovery configurations, not all template variables can be resolved. e.g. %%host%% does not resolve. Fortunately %%env_<ENV_VAR>%% seems to be resolving fine.
  • kops provisions 2 etcd clusters: main and events. 2 instances of etcd integration is required with slightly different tls_cert and tls_private_key. Although I've verified these are interchangeable.
  • kops uses etcd-manager as the parent process for etcd. Port 2380/2381 is for peer communication (server-to-server), and 4001/4002 is for client communication (client-to-server). In this case the agent will be a "client" of etcd server, using port 4001 / 4002 is desired (instead of the 2379 port in normal etcd setup).
  • kube-scheduler serves HTTP on port 10251 and HTTPS on port 10259
  • kube-controller-manager serves HTTP on port 10252 and HTTPS on port 10257

value.yaml for datadog helm chart

datadog:
ignoreAutoConfig:
- etcd
- kube_scheduler
- kube_controller_manager
- kube_apiserver_metrics
agents:
volumes:
- name: my-config
configMap:
name: my-datadog-configmap
- name: etcd-pki
hostPath:
path: /etc/kubernetes/pki
volumeMounts:
- name: etcd-pki
mountPath: /host/etc/kubernetes/pki
readOnly: true
- name: my-config
mountPath: /etc/datadog-agent/conf.d/kube_apiserver_metrics.d/conf.yaml
subPath: kube_apiserver_metrics.yaml
- name: my-config
mountPath: /etc/datadog-agent/conf.d/etcd.d/conf.yaml
subPath: etcd.yaml
- name: my-config
mountPath: /etc/datadog-agent/conf.d/kube_scheduler.d/conf.yaml
subPath: kube_scheduler.yaml
- name: my-config
mountPath: /etc/datadog-agent/conf.d/kube_controller_manager.d/conf.yaml
subPath: kube_controller_manager.yaml

Explanation

  • Certificates and private keys (located under /etc/kubernetes/pki from host) is required by etcd client-to-server communication.

--

--

Xing Du

Minimalist. Game Developer. Software Engineer. DevOps enthusiast. Foodie. Gamer.